Uncovering associations between pre-existing conditions and COVID-19 Severity: A polygenic risk score approach across three large biobanks

Objective To overcome the limitations associated with the collection and curation of COVID-19 outcome data in biobanks, this study proposes the use of polygenic risk scores (PRS) as reliable proxies of COVID-19 severity across three large biobanks: the Michigan Genomics Initiative (MGI), UK Biobank (UKB), and NIH All of Us. The goal is to identify associations between pre-existing conditions and COVID-19 severity. Methods Drawing on a sample of more than 500,000 individuals from the three biobanks, we conducted a phenome-wide association study (PheWAS) to identify associations between a PRS for COVID-19 severity, derived from a genome-wide association study on COVID-19 hospitalization, and clinical pre-existing, pre-pandemic phenotypes. We performed cohort-specific PRS PheWAS and a subsequent fixed-effects meta-analysis. Results The current study uncovered 23 pre-existing conditions significantly associated with the COVID-19 severity PRS in cohort-specific analyses, of which 21 were observed in the UKB cohort and two in the MGI cohort. The meta-analysis yielded 27 significant phenotypes predominantly related to obesity, metabolic disorders, and cardiovascular conditions. After adjusting for body mass index, several clinical phenotypes, such as hypercholesterolemia and gastrointestinal disorders, remained associated with an increased risk of hospitalization following COVID-19 infection. Conclusion By employing PRS as a proxy for COVID-19 severity, we corroborated known risk factors and identified novel associations between pre-existing clinical phenotypes and COVID-19 severity. Our study highlights the potential value of using PRS when actual outcome data may be limited or inadequate for robust analyses.

1.1.First, COVID GWAS contains UKBB EUR samples.The samples that consist of one third of the whole discovery GWAS sample size were then used as the target data in the pheWAS analysis, and most of the signals are from UKBB EUR samples.If the phenotype used in the discovery GWAS and the target data prediction are the same, the overfitting would be very severe for sure.Although the phenotype used in discovery GWAS and pheWAS are different, they still can be correlated and therefore cause overfitting pheWAS results.The authors should rule out the possibility that the signals from the UKBB EUR sample result from overfitting, especially when almost all the significant results are from the UKBB EUR samples.Since the pheWAS in EUR samples is the main findings of this project, all the following analysis and discussion would be questionable if validity of this result cannot be confirmed.One solution could be collaborating with COVID GWAS consortium and getting a GWAS without UKBB samples.
We thank the reviewer for highlighting the importance of avoiding overfitting due to sample overlap.We were conscious of this potential issue from the outset and intentionally selected GWAS summary statistics that did not include UK Biobank samples for the PRS analysis of the UK Biobank cohort.To clarify this point, we have added the following sentence to the methods section (pp 10 -11): "To mitigate the risk of overfitting and to ensure the robustness of our findings, PRSs for the UK Biobank cohort were specifically calculated using GWAS meta-analysis results that excluded UK Biobank samples ('leave_23andme_and_UKBB'): "B1_ALL_leave_23andme_and_UKBB" [12,455 cases vs. 61,144 controls]) and "B2_ALL_leave_23andme_and_UKBB" [40,929 cases vs. 1,924,400 controls]).In contrast, the PRS for the other two cohorts were based on GWAS that included UK Biobank samples."

Second, I agree with the authors that the inconsistency between EUR and non-EUR results can be caused by the small sample size of non-EUR data and low transferability across populations. I
would like to further point out that for traits that are influenced by many confounding factors, like getting COVID during the pandemic, the PRS transferability across populations can be even lower than traits mainly caused by biological or genetic factors .The COVID GWAS used in this study is mainly based on EUR samples.The PRS based on this GWAS is very likely to be heavily influenced by EUR-specific factors (not only LD structure, but also other confounding factors only existing in the EUR populations) and therefore cannot sufficiently represent the likelihood of getting COVID or the severity of COVID in non-EUR populations.It could be another reason for signals from the AFR and EUR samples being negatively correlated.
We appreciate the reviewer's insightful comments highlighting the potential limitations of PRS applicability across different populations.Considering this, we have amended the Discussion section (pp 25 -26) to address these concerns more directly: "Secondly, we did not assess the predictive performance of the COVID-19 severity PRS as it is usually recommended for newly developed PRS [67] due to a lack of well-characterized COVID-19 cases/severity and small sample sizes.Instead, we relied only on the discovery GWAS and the applied PRS method, i.e., any biases or confounding in the underlying GWAS may have also biased the resulting PRS.In particular, the predictive accuracy of PRS is likely diminished for non-European individuals due to the GWAS being based primarily on European samples, where EUR-specific environmental and socio-economic factors, in addition to genetic factors, may significantly influence COVID-19 severity.Thirdly, our approach did not work for non-European subsets, which could be due to their substantially smaller sample sizes and the well-established lack of transportability of PRS across diverse populations [68].This underscores the need to establish larger, more diverse populations to expand the investigation of a COVID-19 severity PRS to a broader group of individuals.Finally, we did not account for selection bias in the three cohorts, which could explain some of the heterogeneity we observed in the meta-analysis.For example, MGI is a hospital-based cohort enriched for patients undergoing surgery [46], and UKB is a population-based cohort that was reported to have a "healthy volunteer" selection bias [69].At the same time, All of Us has purposefully oversampled certain underrepresented subgroups [44,70].While many of our PheWAS results align with previous reports, moving forward, it is imperative to include and analyze more representative samples of non-European populations and to apply ancestry aware PRS methods to improve the accuracy and applicability of PRS PheWAS in diverse ancestry groups."1.3.I also agree with the authors that the correlation between the PRS and the actual severity of COVID could not be tested with the current data due to the existing poor phenotyping of COVID.Therefore, it would be great if the author could have other proxies of severity of COVID to support the findings based on the COVID PRS.
We acknowledge the reviewer's point regarding the limitation imposed by poor phenotyping of COVID-19 severity in our dataset.As suggested, we agree that the inclusion of alternative severity proxies could enrich the analysis.While our current dataset restricts us to hospitalization status, we advocate for future studies to encompass a more nuanced array of clinical endpoints, such as mechanical ventilation, ICU admission rates, and the presence of specific immune biomarkers.This recommendation has been added to the discussion section to guide subsequent research (p 25): "Future research should strive for a more standardized definition of COVID-19 severity, incorporating additional proxies such as mechanical ventilation requirements, ICU admissions, or specific immune biomarkers, to improve the evaluation of severity PRS models and facilitate cross-study comparisons." In conclusion, while the authors' work addressed an important topic of the relationships between COVID severity PRS and various phenotypic traits, the study would greatly benefit from addressing the aforementioned concerns to fortify the overall robustness and reliability of its findings.

Reviewer #2
Report: Uncovering Associations between Pre-existing Conditions and COVID-19 Severity: A Polygenic Risk Score Approach Across Three Large Biobanks The main contribution of this paper is, Authors investigated the use of polygenic risk scores (PRS) as reliable proxies of COVID-19 severity across three large biobanks: the Michigan Genomics Initiative (MGI), UK Biobank (UKB), and NIH All of Us, to identify associations between pre-existing conditions and COVID-19 severity.By utilizing PRS as a proxy for COVID-19 severity, Authors identified known risk factors and novel associations between pre-existing clinical phenotypes and COVID-19 severity.
We thank Reviewer #2 for their insightful feedback on our manuscript.Their suggestions have been instrumental in refining our study.

Authors performed the analysis stratified by Biobanks due to varying sampling strategies in these
Biobanks.It would be great to include the details in the paper such as how they are not comparable with each other.
We agree that it's important to clarify that the three biobanks we've analyzed recruit their participants in very different ways.The MGI is based in a hospital setting and tends to include individuals with specific health issues [PMID: 36819667].The UK Biobank's participants, despite being randomly selected from the population, are generally healthier than the average person in the UK.This discrepancy is known and could lead to a lower rate of reported health issues

Would it be possible to describe the phenotypic categories in each bank separately with the help of a plot?
The reviewer raised an important point we have addressed by creating a new bar plot that compares the prevalence of phenotype categories across the three cohorts.This visualization can be found in

In addition to the table, it would be great to show the association of phenotypes, PheWAS results with the help of plot such as Forest plot.
Thank you for the suggestion.To complement the supplementary tables, we added forest plots of the significant associations of the meta-analysis to the supplementary material (Fig R2 We now added details to our method section to include information on participants and provided a concise summary of the underlying analysis for COVID-19 severity and susceptibility GWAS metaanalysis (pp 10 -11): "We downloaded the GWAS meta-analysis summary statistics on COVID-19 severity from the COVID-19 Host Genetics Initiative (COVID19-hg GWAS meta-analyses round 7; release date: April 8, 2022; also see Web Resources).We considered summary statistics from two GWAS meta-analyses: (1) "B1_ALL": hospitalized COVID-19 versus not hospitalized COVID-19 ("B1_ALL_leave_23andme" [16512 cases vs. 71321 controls] and (2) "B2_ALL": hospitalized COVID-19 versus population controls ("B2_ALL_leave_23andme" [44,986 cases vs. 2,356,386 controls].To mitigate the risk of overfitting and to ensure the robustness of our findings, PRSs for the UK Biobank cohort were specifically calculated using GWAS meta-analysis results that excluded UK Biobank samples ('leave_23andme_and_UKBB'): "B1_ALL_leave_23andme_and_UKBB" [12,455 cases vs. 61,144 controls]) and "B2_ALL_leave_23andme_and_UKBB" [40,929 cases vs. 1,924,400 controls]).In contrast, the PRS for the other two cohorts were based on GWAS that included UK Biobank samples.The underlying meta-analyses utilized a standard association model, including covariates for age, sex, the first 20 principal components (PCs), and study-specific technical covariates, excluding heritable risk factors and comorbidities.Each contributing cohort conducted GWAS under this framework, employing the SAIGE software [PMID: 30104761] to account for relatedness and casecontrol imbalance.For a comprehensive account of the participant demographics and individual study contributions, see Table S1 in the supplementary material, which lists sample sizes and ancestry data for the "B1_ALL" meta-analysis."

Even though PRS-CS is a well-known PRS development technique, it would be good to include short summary of the algorithm in the paper.
We added a summary of the PRS-CS algorithm to the methods section (p 11): "We used the software package "PRS-CS" [37] to define PRS weights based on a Bayesian regression framework employing continuous shrinkage (CS) priors.Briefly, PRS-CS adjusts the SNPs 'effect sizes to account for their associations with the trait of interest and the local LD patterns, [PMID: 28641372].The All of Us Research Program uses a mix of open invitations and partnerships with healthcare provider organizations to ensure a diverse and representative set of participants [PMID: 31412182].We added the following to the Discussion section (p 22): "In examining the distribution of diagnoses across various categories of diseases in unrelated European ancestry cohorts from hospital-based (MGI), population-based (UKB), and the All of Us cohorts, certain patterns emerge, as illustrated in Fig S14.Generally, the MGI cohort exhibits a higher proportion of affected individuals across all categories, reflective of its hospital-based nature [PMID: 36819667].In contrast, the UKB data, representing a population-based sample, consistently reports lower diagnosis rates, especially for congenital anomalies [PMID: 28641372].The All of Us cohort demonstrates intermediate values reflective of their recruitment, a mix of open invitations and partnerships with healthcare provider organizations [PMID: 31412182].These observations highlight the variability in health condition prevalence across different cohort types and underscore the importance of considering the cohort source and recruitment strategies when interpreting disease frequency data."

Fig R3. 1
Fig R3.1 Scatter plot demonstrating the SNP effect of smoking initiation on COVID-19 severity (B1 ALL).Each dot represents one of 117 SNPs, with the SNP effect on smoking initiation plotted on the x-axis and the SNP effect on COVID-19 severity depicted on the y-axis.The vertical and horizontal lines encompassing each dot correspond to the confidence intervals for the SNP effects.Mendelian Randomization (MR) estimates from different methods are illustrated as solid lines across the scatter plot: light blue for the Inverse Variance Weighted method, green for the Weighted Median method, blue for the MR Egger method, red for the Weighted Mode method, and light green for the Simple Mode method.The intersection of these lines provides an overall estimate of the causal effect of cigarettes smoked per day on COVID-19 severity.

Fig R3. 2
Fig R3.2 Scatter plot demonstrating the SNP effect of cigarettes smoked per day on COVID-19 severity (B1 ALL).Each dot represents one of 82 SNPs, with the SNP effect on cigarettes smoked per day plotted on the x-axis and the SNP effect on COVID-19 severity depicted on the y-axis.The vertical and horizontal lines encompassing each dot correspond to the confidence intervals for the SNP effects.Mendelian Randomization (MR) estimates from different methods are illustrated as solid lines across the scatter plot: light blue for the Inverse Variance Weighted method, green for the Weighted Median method, blue for the MR Egger method, red for the Weighted Mode method, and light green for the Simple Mode method.The intersection of these lines provides an overall estimate of the causal effect of cigarettes smoked per day on COVID-19 severity.

Fig R3. 3
Fig R3.3Scatter plot demonstrating the SNP effect of smoking initiation on COVID-19 susceptibility (B2 ALL).Each dot represents one of 116 SNPs, with the SNP effect on smoking initiation plotted on the x-axis and the SNP effect on COVID-19 severity depicted on the y-axis.The vertical and horizontal lines encompassing each dot correspond to the confidence intervals for the SNP effects.Mendelian Randomization (MR) estimates from different methods are illustrated as solid lines across the scatter plot: light blue for the Inverse Variance Weighted method, green for the Weighted Median method, blue for the MR Egger method, red for the Weighted Mode method, and light green for the Simple Mode method.The intersection of these lines provides an overall estimate of the causal effect of cigarettes smoked per day on COVID-19 severity.

Fig R3. 4
Fig R3.4Scatter plot demonstrating the SNP effect of cigarettes smoked per day on COVID-19 susceptibility (B2 ALL).Each dot represents one of 83 SNPs, with the SNP effect on cigarettes smoked per day plotted on the x-axis and the SNP effect on COVID-19 severity depicted on the yaxis.The vertical and horizontal lines encompassing each dot correspond to the confidence intervals for the SNP effects.Mendelian Randomization (MR) estimates from different methods are illustrated as solid lines across the scatter plot: light blue for the Inverse Variance Weighted method, green for the Weighted Median method, blue for the MR Egger method, red for the Weighted Mode method, and light green for the Simple Mode method.The intersection of these lines provides an overall estimate of the causal effect of cigarettes smoked per day on COVID-19 severity.

.2 and R2.3, Fig S7 and S8).
Forest plots of the BMI-unadjusted association between the PRS for COVID-19 severity and various pre-pandemic phenotypes.Each of the 27 panels (A -AA) represents a phenotype that reached phenomewide significance in the meta-analysis.Each panel is labeled with its description and phecode.For each phenotype, odds ratios (ORs) and 95% confidence intervals (CIs) are shown for the MGI, UKB, All of Us studies, and the overall meta-analysis.The ORs and 95% CIs are also numerically represented on the right side of each plot.The vertical dashed line represents an OR of 1.The I 2 statistic, Q statistic, and P-value for heterogeneity are shown for the meta-analysis.

Z Abdominal pain (785)
Forest plots of the BMI-adjusted association between the PRS for COVID-19 severity and various pre-pandemic phenotypes.Each of the 10 panels (A -J) represents a phenotype that reached phenome-wide significance in the meta-analysis adjusted for BMI.Each panel is labeled with its description and phecode.For each phenotype, odds ratios (ORs) and 95% confidence intervals (CIs) are shown for the MGI, UKB, All of Us studies, and the overall meta-analysis.The ORs and 95% CIs are also numerically represented on the right side of each plot.The vertical dashed line represents an OR of 1.The I 2 statistic, Q statistic, and P-value for heterogeneity are shown for the meta-analysis.Please include the details of the summary statistics on COVID-19 severity such the details of the participants, summary of the analysis etc.
PRS that more accurately reflects the complex genetic architecture of a trait.It does not require individual-level data but integrates GWAS summary statistics with a provided, precomputed, ancestry-specific LD reference panel for up to 1,117,425 common, non-ambiguous, autosomal SNPs based on samples of the UK Biobank (see Web Resources).We opted for PRS-CS because it has demonstrated superior performance to other PRS methods, likely attributable to its adaptable modeling assumptions [55]."2.6.Please include in the discussion section how this paper is different from similar papers in the literature such as if we use only the PRS with known loci.Unlike most studies where COVID-19 risk SNPs have been used as weak instruments [PMID: 34655949], in our research, a PRS serves as a robust proxy for the severity of COVID-19.This is particularly significant given the availability of accurate PRS data for a large number of individuals, contrasting with the often poor quality or incompleteness of COVID-19 outcome data.By broadly capturing genetic variations related to COVID-19 outcomes, our PRS expands risk prediction capabilities beyond what is possible with analyses restricted to known loci like ACE2 and TMPRSS2.This agnostic approach aligns with the aims of initiatives such as the COVID-19 Host Genetics Initiative, which seeks to discover genetic factors impacting patient outcomes [PMID: 32393819], underscoring the value of wide-ranging genetic investigations in understanding disease risks and informing clinical decisions."We employed Mendelian Randomization (MR), a statistical method that uses genetic variants as instrumental variables, to infer potential causal relationships between smoking-related traits and COVID-19 susceptibility.Genetic instruments for our exposures-smoking initiation and cigarettes per day-were sourced from genome-wide association study (GWAS) datasets for smoking initiation (https://conservancy.umn.edu/bitstream/handle/11299/201564/SmokingInitiation.txt.gz) and for